Blue Bikes, formerly known as Hubway, is a public bike-sharing system across the metropolitan areas of Boston. Participating individuals can either be Blue Bike members (including annual or monthly members) or casual single-trip day pass users. Blue Bikes publishes downloadable files of trip user data each quarter, and for this analysis, I will be using the compiled 2019 dataset from Kaggle in addition to the station dataset published by Blue Bikes.
The data set from Kaggle was originally retrieved from Blue Bikes and then compiled and collected for the year 2019. Blue Bikes also publishes downloadable files of station data, I will be joining the 2019 trip data from Kaggle with the station data to analyze municipality data. Both datasets used in this analysis was downloaded as CSV files and read as a dataframe in R.
By leveraging this comprehensive data set, I aim to uncover insightful patterns associated with bike rental usage in Boston. For this analysis, I am interested in taking a deeper look at short trips (trip duration of 2 hours or less). It’s noteworthy that the overwhelming majority of data in the Blue Bikes dataset for 2019 comprises trips lasting under 2 hours. Trips exceeding this threshold accounted for less than 1% of the data, specifically 0.649%.
This investigation will encompass various areas, including the examination of trip duration across different districts, the identification of key districts driving Blue Bike rental demand, and the assessment of the distribution of total trip volume throughout the year. Additionally, I have conducted an analysis of applying statistical principles such as the central limit theorem to gain a deeper understanding of the distribution of trip duration. I will also explore various sampling methods to determine which provides the most accurate representation of our population data.
In the ‘Analysis’ section of this document, you will encounter a variety of plots, visualizations, and analyses focusing on different aspects of the Blue Bikes dataset. Within this section, I delve into the intricacies of the dataset and uncover underlying patterns.
The user trip dataset includes 17 columns and over 2 million rows while the Blue Bikes station dataset includes 5 columns and 421 rows. For this analysis, I am interested in understanding patterns for those individuals who rented a bike for short trips, specifically 2 hours or less. To ensure the integrity of our analysis, I will be excluding instances of longer rental durations from the dataset.
New columns I will be adding:
tripduration_minutes : The dataset currently only has a
trip duration column in seconds, I will be creating a new column to get
the duration in minutesday : The data only has month and year column. I will
be creating a day columndate : This will be the date of started trip without a
time stampIn order to provide a concise analysis of some of these attributes,
cleaning up the data is necessary. This includes changing data types and
removing unneeded rows (trips over 2 hours). I will also be adding a new
column tripduration_minutes which uses the
tripduration column (shown in seconds) to convert trip
duration to minutes. Finally, I will be joining the 2019 trip data set
with the station dataset using the station name variable from both
tables. After the tables are joined, I will delete duplicate
columns.
I used the following R libraries/packages to conduct this analysis:
readr, knitr, kableExtra,
tidyverse, plotly, sampling and
leaflet
Note: Some values may produce an NA value when joining. This may be because the station data is more recently updated than the trip data set (from 2019). When analyzing this data in our analysis, later on, we will omit NA.
Source: Information on the attributes found on Blue Bikes and Kaggle
tripduration : duration of a bike trip in secondstripduration_minutes : column added for analysis.
duration of bike trip in minutesstarttime: timestamp of start time of tripstoptime : timestamp of end time of tripstart station id : unique stationID where trip
startedstart station name : name of the station at start of
tripstart station latitude : latitude of start stationstart station longitude : longitude of start
stationend station id : unique stationID where trip endedend station name : name of station at end of tripend station loatitude : latitude of end stationend station longitude : longitude of end stationbikeid : unique ID of bike used for the tripusertype : Customer (casual single trip or day pass
user) or Subscriber (annual or monthly member)year : year when trip took place, for our data, this
will all be 2019month : month when trip took place (numerical
1-12)day : created column, day of month trip took placedate : created column, date of trip in YYYY-MM-DD
formatbirth year : birth year of user, this is self
reportedgender : gender of user, this is self reportedstart_district : district at start of trip. Boston,
Brookline Cambridge, Everett, Somerville or NAstart_total_docks : total number of docs at start of
tripend_district : district at end of trip. Boston,
Brookline Cambridge, Everett, Somerville or NAend_total_docks : total number of docs at end of
tripBelow, you will find the top 100 rows in the dataset we will be using to analyze Blue Bike trips across the Boston Metropolitan areas
The distribution of Bike Rentals throughout the year is not at all surprising. We can see that September is the most popular month for Blue Bike rentals and January is recorded as the lowest. This trend suggests a general preference for biking during warmer months and a decrease during colder ones, which is understandable given Boston’s chilly winters.
The scatter plot below illustrates daily Blue Bike rental counts, with each point representing the number of rentals for that day. The trend shows a positive incline towards the warmer months and a negative trend towards the colder months.
daily_count <- bluebikes %>%
group_by(date) %>%
summarize(count = n())
daily_count_scatter <- plot_ly(x = daily_count$date, y = daily_count$count, type = "scatter", mode = "markers") %>%
layout(title = "Daily Counts of Bike Bike Rentals in 2019",
xaxis = list(title = "Date"),
yaxis = list(title = "Bike Rentals"),
font = list(family = "Proxima Nova, sans-serif", size = 16, color = "black"),
margin = list(t = 50, b = 50))
daily_count_scatter
Below you will find the distribution of daily Blue Bike rental averages by month. Each bar represents a specific month in the year 2019. The height of each bar represents the average daily bike rental for that month. The shape of this distribution follows a similar pattern to the scatter plot of daily rentals throughout 2019.
monthly_avg <- bluebikes %>%
group_by(month, day) %>%
summarize(daily_trips = n()) %>%
group_by(month) %>%
summarize(average = mean(daily_trips))
monthly_avg_bar <- plot_ly(x = monthly_avg$month, y = monthly_avg$average, type = "bar") %>%
layout(
title = "Daily Blue Bike Rental Averages by Month",
xaxis = list(
tickmode = "array",
tickvals = 1:12,
ticktext = month.abb),
yaxis = list(title = "Bike Rental Average"),
font = list(family = "Proxima Nova, sans-serif", size = 16, color = "black"),
margin = list(t = 50, b = 50))
monthly_avg_bar
In this section of our analysis, we will examine the district data provided by Blue Bikes, encompassing five districts: Boston, Cambridge, Everett, Brookline, and Somerville. We will exclude any NA values from our analysis.
The first tab will showcase the distribution of bike rentals throughout the year for each district. In the second tab, we will explore the total number of bike rentals or trips in each district for the year 2019. The third tab will focus on the distribution of trip duration in minutes for each district throughout the year. Lastly, in the fourth tab, we will delve into the distribution of trip duration in minutes specifically for Everett, MA.
The time series of total daily bike rides throughout the year by district is shown below. In Blue, we predictably find Boston to generally have the highest number of bike rides throughout the year in comparison with the other districts. If you take a look at the interactive plot, you will find that the shape of the distributions are consistent in all the districts (with the exception of Everett, MA which is introduced in Spring 2019).
I use the following attributes to create this line chart below:
start_district, date and daily trip count.
district_trips_rides<- bluebikes %>%
filter(!is.na(start_district)) %>%
group_by(date, start_district) %>%
summarize(daily_trips = n()) %>%
group_by(start_district)
district_trips_rides <- district_trips_rides %>%
pivot_wider(names_from = start_district, values_from = daily_trips)
district_trips_rides <- district_trips_rides %>%
arrange(date)
fig_district_trips <- plot_ly(data = district_trips_rides, x = ~date) %>%
add_trace(y = ~Boston, name = "Boston", type = "scatter", mode = "lines") %>%
add_trace(y = ~Cambridge, name = "Cambridge", type = "scatter", mode = "lines") %>%
add_trace(y = ~Somerville, name = "Somerville", type = "scatter", mode = "lines") %>%
add_trace(y = ~Brookline, name = "Brookline", type = "scatter", mode = "lines") %>%
add_trace(y = ~Everett, name = "Everett", type = "scatter", mode = "lines")%>%
layout(xaxis = list(title = "Date"),
yaxis = list(title = "Number of Trips"))
fig_district_trips
Let’s look at the total number of trips for the year 2019 by each starting district (where the trip started). Below you will find a bar chart and pie chart of the distributions by starting district. Both charts show that the city of Boston makes up the greatest proportion of total trips, 55% of the data, while Cambridge is the runner-up. The district with the lowest proportion is Everett, making up about 0.219 % of the data which is 4,905 trips. This conclusion is consistent with our findings analalyzed in the time series.
I use the following attributes to create the bar chart below:
start_district and daily trip count
district_trips <- bluebikes %>%
filter(!is.na(start_district)) %>%
group_by(start_district) %>%
summarize(daily_trips = n())
fig_district_counts <- plot_ly(district_trips, x = ~start_district, y = ~daily_trips, type = 'bar',
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)',
width = 1.5)))
fig_district_counts <- fig_district_counts %>%
layout(title ="Total Number of Blue Bike Trips In Each District in 2019",
xaxis = list(title = "District At Start of Trip"),
yaxis = list(title = "Total Bike Rentals"))
fig_district_counts
I use the following attributes to create this piechart:
start_district and daily trip count
colors <- c('rgb(20,2,211)', 'rgb(128,133,133)', 'rgb(144,103,167)', 'rgb(171,104,87)', 'rgb(114,147,203)')
fig_district_counts_pie <- plot_ly(district_trips,
labels = ~start_district, values = ~daily_trips,
type = 'pie', pull = c(0.05, 0.025, 0.05, 0.2, 0.1),
marker = list(colors = colors))
fig_district_counts_pie <- fig_district_counts_pie %>%
layout(title = 'Proportions of Total Trips By Starting District',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig_district_counts_pie
Although the starting district of Everett had the lowest proportion of total trips in 2019, it has a significantly higher average trip duration (in minutes). This may be because where Everett is located. We can also see that the line for Everett starts in June, and after further research, I found that Blue Bikes was first introduced to Everett, MA in Spring of 2019. This also explains the low proportion of trips taken in 2019
In the line plot below, we find that the trip duration on average for Everett was consistently higher than the other districts.
I am using the following attributes to create the visualization
below: start_district, month and
tripduration_minutes
district_trips_duration <- bluebikes %>%
filter(!is.na(start_district)) %>%
group_by(month, start_district) %>%
summarize(avg_trip_duration = mean(tripduration_minutes)) %>%
group_by(start_district)
district_trips_duration <- district_trips_duration %>%
pivot_wider(names_from = start_district, values_from = avg_trip_duration)
district_trips_duration <- district_trips_duration %>%
arrange(month)
fig_district_trips_duration <- plot_ly(data = district_trips_duration, x = ~month) %>%
add_trace(y = ~Boston, name = "Boston", type = "scatter", mode = "lines") %>%
add_trace(y = ~Cambridge, name = "Cambridge", type = "scatter", mode = "lines") %>%
add_trace(y = ~Somerville, name = "Somerville", type = "scatter", mode = "lines") %>%
add_trace(y = ~Brookline, name = "Brookline", type = "scatter", mode = "lines") %>%
add_trace(y = ~Everett, name = "Everett", type = "scatter", mode = "lines")%>%
layout(xaxis = list(title = "Date", tickmode = "array",
tickvals = 1:12,
ticktext = month.abb),
yaxis = list(title = "Average Trip Duration"))
fig_district_trips_duration
Below, you will find a box plot of the distribution of trip duration across all districts. As you can see, they all are right skwed with a significant amount of outliers. Everett has a slightly higher median, at 18 minutes, in comparison with the other districts.
district_trips_duration_boxplot <- bluebikes %>%
filter(!is.na(start_district))
district_trips_duration_boxplot |>
plot_ly(x = ~tripduration_minutes, color = ~start_district, type = "box")
From what we found in the trip duration time series analysis, Everett appeared to have the highest on average trip duration in comparison with the other districts.
Now, let’s examine the spread of the numerical attribute, trip duration (in minutes), for Everett, MA. The distribution reveals a right skew, accompanied by several outliers. The median trip duration in Everett is approximately 18 minutes. This suggests that the averages shown in the line plot were likely influenced by outliers.
everett_trips_duration <- bluebikes %>%
filter(start_district == "Everett")
everett_trip_duration_boxplot <- plot_ly(x = everett_trips_duration$tripduration_minutes, type = "box", orientation = "h", name = "") %>%
layout(xaxis = list(range = c(0, max(everett_trips_duration$tripduration_minutes) * 1.1)),
title = "Everett Trip Duration Distribution")
everett_trip_duration_boxplot
The Blue Bikes dataset offers a comprehensive array of attributes encompassing both metropolitan and user-specific data. While my primary analysis delves deeply into a select few attributes, I’ve incorporated some additional visualizations and insights across the user type and gender elements of the dataset. Below, you’ll find the tabs, each dedicated to these two distinct areas of the dataset, offering glimpses into the data.
One attribute of this dataset that I was interested in is
usertype. This column has two categories:
Subscriber and Customer. Subscribers are either annual
or monthly members and Customers are casual single trio day pass users.
Here I will take a glimpse of the distribution of different user
types.
Findings: Although subscribers make up the greater proportion of total users (79.2% of the data), it looks like Customers have a greater on average trip duration (in minutes). After looking into the Blue Bikes pricing, it looks like Annual members include unlimited 45 minute rides at a time with an extra cost for additional time. Day pass users can get 24 hour access, 2 hour rides at a time, for a fixed amount. This may explain the increased time duration when compared to annual members.
user_data_count <- bluebikes %>%
group_by(usertype) %>%
summarize(count = n())
user_data_prop <- prop.table(table(bluebikes$usertype))*100
user_data_prop <- as.data.frame(user_data_prop)
names(user_data_prop) <- c("usertype", "Proportion")
combined_user_data <- merge(user_data_count, user_data_prop, by = "usertype")
fig_user_pie <- combined_user_data %>% plot_ly(labels = ~usertype, values = ~count)
fig_user_pie <- fig_user_pie %>% add_pie(hole = 0.6)
kable(combined_user_data, format = "markdown")
| usertype | count | Proportion |
|---|---|---|
| Customer | 521254 | 20.79699 |
| Subscriber | 1985138 | 79.20301 |
fig_user_pie
Bellow you will find a table of average trip duration by month for each user type. The distribution is a timeseries of time duration throughout the year by user type.
user_data <- bluebikes %>%
group_by(month, usertype) %>%
summarize(avg_trip_duration = mean(tripduration_minutes)) %>%
group_by(usertype)
user_data <- user_data %>%
pivot_wider(names_from = usertype, values_from = avg_trip_duration)
user_data <- user_data %>%
arrange(month)
fig_user_type <- plot_ly(data = user_data, x = ~month) %>%
add_trace(y = ~Customer, name = "Customer", type = "scatter", mode = "lines") %>%
add_trace(y = ~Subscriber, name = "Subscriber", type = "scatter", mode = "lines") %>%
layout(xaxis = list(title = "Date", tickmode = "array",
tickvals = 1:12,
ticktext = month.abb),
yaxis = list(title = "Average Trip Duration (minutes)"))
kable(user_data, format = "markdown")
| month | Customer | Subscriber |
|---|---|---|
| 1 | 24.69920 | 11.81342 |
| 2 | 25.68654 | 11.68907 |
| 3 | 30.41976 | 12.24433 |
| 4 | 31.75415 | 12.88817 |
| 5 | 30.05920 | 13.40862 |
| 6 | 29.45022 | 14.37375 |
| 7 | 30.23915 | 14.31636 |
| 8 | 29.34373 | 14.11651 |
| 9 | 26.82751 | 13.45989 |
| 10 | 24.74770 | 12.67666 |
| 11 | 22.70492 | 12.18640 |
| 12 | 21.86632 | 12.09282 |
fig_user_type
Taking a glimpse of the distribution of gender in the Blue Bikes
dataset, we notice that men make a significant proportion of the
population data. I would like to note that gender is self
reported and I decided to omit the unknown genders.
gender_data <- bluebikes %>%
filter(gender != 0) %>%
group_by(gender) %>%
mutate(gender = case_when(gender == 1 ~ "Male", gender == 2 ~ "Female")) %>%
summarize(count = n())
colors <- c('rgb(20,2,211)', 'rgb(128,133,133)')
gender_dist_pie <- plot_ly(gender_data,
labels = ~gender, values = ~count,
type = 'pie')
gender_dist_pie
The distribution of genders across each district stay relatively consistent to what we found in the population.
gender_district_data <- bluebikes %>%
filter(gender != 0) %>%
filter(!is.na(start_district)) %>%
group_by(gender, start_district) %>%
mutate(gender = case_when(gender == 1 ~ "Male", gender == 2 ~ "Female")) %>%
summarize(count = n())
gender_district_data <- gender_district_data %>%
pivot_wider(names_from = gender, values_from = count)
gender_district_plot <- plot_ly(gender_district_data, x = ~start_district, y = ~Male, type = 'bar', name = 'Male')
gender_district_plot <- gender_district_plot %>% add_trace(y = ~Female, name = 'Female')
gender_district_plot <- gender_district_plot %>% layout(yaxis = list(title = 'Count'), barmode = 'stack')
gender_district_plot
The Central Limit Theorem (CLT) is one of many important theorems in statistics. In the Central Limit Theorem, the distribution of \(\bar{x}\), the mean of the samples of a given size, is referred to as the sampling distribution of the sample mean. CLT tells us that the sampling distribution of the sample means becomes more and more normal as the sample size, n increases, regardless of the shape of the original population distribution.
Before we delve into applying the Central Limit Theorem, lets take a
closer look at the attribute we will be using:
tripduration_minutes. This is a variable I created using
the datasets columns tripduration which is the duration of
a single bike trip in seconds. Since the dataset we are using only take
a look at the trips under 2 hours, our max range is 120 minutes. Our
minimum is 2 minutes. This means the distribution has a spread from 2
minutes long to 120 minutes long. Lets take a look at a summary of this
data:
summary_tripdur <- summary(bluebikes$tripduration_minutes)
summary_tripdur
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 7.00 12.00 16.38 20.00 120.00
I have conducted an analysis of applying the Central Limit Theorem to gain a deeper understanding of the distribution of trip duration (in minutes). If we take a deeper look into Trip Duration for bike rides that were 2 hours or less, we can see an exponential distribution, specifically, a right skewed distribution. The mean trip duration (in minutes) is about 16.38 minutes and the standard deviation is about 14.65. The mean is represented by the dashed vertical line. We find that the proportion of bike rides under an hour long make up the majority of our data set.
hist(bluebikes$tripduration_minutes, prob = TRUE, col = "#1f77b4", border = "black",
main = "Trip Duration Histogram",
xlab = "Trip Duration (minutes)",
ylab = "Density",
ylim = c(0,0.07),xlim = c(0, max(bluebikes$tripduration_minutes)))
mean_duration <- mean(bluebikes$tripduration_minutes)
lines(density(bluebikes$tripduration_minutes), col = "red", lwd = 2)
abline(v = mean_duration, col = "violet", lwd = 2, lty = 2)
legend("topright", legend = c("Histogram", "Density Curve", "Mean"),
fill = c("#1f77b4", "red", "violet"),
title = "Key")
The average trip duration in minutes is: 16.3843
The standard deviation of trip duration in minutes is: 14.64135
From what we found in the density distribution of Trip Duration in Minutes, we know that the distribution is exponential and the Central Limit Theorem still holds true even if the data is not from a normal distribution. In this part of our analysis, we will be analyzing the distribution of 1000 random samples of sample sizes 5, 10, 20, and 40.
Findings: While our original Blue Bike data shows a right-skewed or exponential distribution, it’s important to note that with larger sample sizes, the distribution tends to become more normal. Upon closer inspection, you’ll observe that the standard deviation decreases and the distribution narrows as the sample size increases. The mean of the distributions remains around the mean of the data on trip duration in minutes. There is a vast contrast in distributions between a sample size of 5 vs a sample size of 40.
This observation regarding the effect of sample size on the distribution aligns with the Central Limit Theorem (CLT). In the context of this analysis, as sample size of trip durations is increased, the distribution of the sample means becomes more normal, with smaller standard deviation and narrower distribution.
set.seed(1000)
par(mfrow = c(2,2))
samples <- 1000
xbar <- numeric(samples)
for (size in c(5, 10, 20, 40)) {
for (i in 1:samples) {
xbar[i] <- mean(sample(bluebikes$tripduration_minutes, size, replace = FALSE))
}
hist(xbar, prob = TRUE, col = "#1f77b4", border = "black",
main ="",
xlim = c(0, 60),
ylim = c(0, .2))
text(40, 0.18, paste("Sample Size =", size))
text(40, 0.14, paste("Mean =", round(mean(xbar), 2)))
text(40, 0.10, paste("SD =", round(sd(xbar), 2)))
}
In statistics, there are several sampling methods used for data analysis. In this part of our analysis, we’ll examine different sampling techniques and how they represent the Blue Bikes district data. For the purposes of this analysis, we will exclude the district of Everett, as it was introduced only in Spring 2019 and has a very low proportion in the dataset. The sampling methods we will use include Simple Random Sampling With Replacement, Stratified Random Sampling, and Systematic Sampling.
Findings: After collecting samples using these methods, we found that Stratified Random Sampling provided the closest representation of the actual distribution of total bike rides by district. With Stratified Random Sampling, each subgroup/strata (the districts) of the population is ensured to be represented in the sample and this can lead to a more accurate depiction of the population. In addition to finding the best sampling method for our population, we also found the worst. Systematical Sampling may have introduced a sampling error- we can see that Cambridge has a significantly higher proportion when compared to the population distribution. It is known that if there are hidden patterns in the population data, there is also a higher chance of sampling errors with this method. In conclusion, if Stratified Random Sampling was used instead of the whole dataset, I believe that it would provide an accurate representation of our population data. I would not, however, reccomend using Systematical Sampling due to the patterns we have found in our analysis.
| District | Population_Proportion | SRS_Proportion | SS_Proportion | SRSWR_Proportion |
|---|---|---|---|---|
| Boston | 55.108528 | 55.0 | 42.5 | 50.0 |
| Brookline | 2.579433 | 2.5 | 2.5 | 5.0 |
| Cambridge | 36.956820 | 37.5 | 45.0 | 37.5 |
| Somerville | 5.355219 | 5.0 | 10.0 | 7.5 |
Take a closer look at each sampling method by selecting one of the tabs below
Firstly, lets take a look at the distribution of Bike Share Rides in our population data (omitting Everett). We see that there the order from highest proportion to lowest is as follows: Boston, Cambridge, Somerville and lastly Brookline.
| District | Count | Proportion |
|---|---|---|
| Boston | 1233228 | 55.108528 |
| Brookline | 57723 | 2.579433 |
| Cambridge | 827026 | 36.956820 |
| Somerville | 119840 | 5.355219 |
| District | Count | Proportion |
|---|---|---|
| Boston | 22 | 55.0 |
| Brookline | 1 | 2.5 |
| Cambridge | 15 | 37.5 |
| Somerville | 2 | 5.0 |
| District | Count | Proportion |
|---|---|---|
| Boston | 17 | 42.5 |
| Brookline | 1 | 2.5 |
| Cambridge | 18 | 45.0 |
| Somerville | 4 | 10.0 |
| District | Count | Proportion |
|---|---|---|
| Boston | 20 | 50.0 |
| Brookline | 2 | 5.0 |
| Cambridge | 15 | 37.5 |
| Somerville | 3 | 7.5 |
Bellow you will find a map with all the unique stations in the Blue Bikes dataset. I used the leaflet package in R to help me generate this map. To learn more about leaflet, I looked at documentation found on geeksforgeeks.
bluebikes_map_data <- bluebikes %>% # removing NA values from start_district
filter(!is.na(start_district))
point <- unique(data.frame(lat = bluebikes_map_data$`start station latitude`, long = bluebikes_map_data$`start station longitude`))
latitude <- point$lat
longitude <- point$long
blue_bike_station_map <- leaflet() %>%
addProviderTiles("CartoDB.Voyager") %>%
addMarkers(lat = 42.361145, lng = -71.057083, popup = "Blue Bike Stations") # Boston coordinates
for (i in 1:length(longitude)) {
blue_bike_station_map <- addMarkers(map = blue_bike_station_map, lng = longitude[i], lat = latitude[i])
}
blue_bike_station_map
This analysis primarily focuses on Blue Bike trip duration, district data, and bike rentals throughout the year. We found that the total daily bike rides throughout the year follow a similar pattern regardless of the district at the start of the trip, with lower volumes in colder Boston months and higher volumes in the warmer months. Although Everett had the lowest daily trip rides (most likely due to its introduction in Spring 2019), it had the highest average trip duration. It is interesting to see the stark differences in distributions of user types. Customers/non-subscribing members typically have longer trips compared to subscribers; however, they make up much less of the data than subscribers.
We found that our observation of the Central Limit Theorem aligns with what we expected to happen based on our understanding of the theorem: as the sample size increased, the standard deviation decreased, and the distribution became narrower. Additionally, Stratified Random Sampling provided the closest representation of the Blue Bikes data we were working with.
Some problems I encountered with this data were NA values. Some column information, for example, gender, was based on user input data, so it may not have been an accurate representation of Blue Bike usage.